[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC by xxi-nv · Pull Request #12060 · NVIDIA/TensorRT-LLM

xxi-nv · 2026-03-10T03:10:38Z

Summary

Fix MoE DeepEP hang caused by non-deterministic garbage collection timing across ranks
DeepEP Buffer.__del__ calls intranode::barrier (a collective op). Without explicit synchronous release, GC timing differences across ranks cause some ranks to block in the barrier indefinitely
Add destroy() method to Communication base class and DeepEP/DeepEPLowLatency subclasses for explicit buffer release
Call destroy() in ConfigurableMoE when falling back from DeepEP to AllGatherReduceScatter
Make ConfigurableMoE a context manager so resources are released on scope exit
Re-enable DEEPEPLOWLATENCY multi-GPU tests on H100 that were previously skipped due to this hang

Test plan

Unit tests updated to use context manager pattern (with create_moe(...) as fused_moe)
Multi-GPU MoE tests with DEEPEPLOWLATENCY on H100 (previously hanging, now expected to pass)
CI validation

Summary by CodeRabbit

New Features
- MoE communication components now support context manager syntax (with statement) for streamlined resource management.
- Added explicit lifecycle control methods to communication layer components.
Tests
- Enhanced test coverage for communication subsystem reliability and low-latency scenarios.

Fix multi-GPU hangs caused by non-deterministic GC of DeepEP buffers by adding explicit destroy() to Communication classes and calling it before fallback to AllGatherReduceScatter. Also add explicit comm resource cleanup in test teardown to prevent inter-test barrier deadlock from lingering DeepEP buffers. Remove the now-unnecessary DEEPEPLOWLATENCY skip on H100 (SM90). Signed-off-by: xxi <xxi@nvidia.com>

xxi-nv · 2026-03-10T03:14:20Z

/bot run

coderabbitai · 2026-03-10T03:14:29Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 843137a8-c68d-4849-99c3-18560b46e674

📥 Commits

Reviewing files that changed from the base of the PR and between 8d8b84b and 140c9d2.

📒 Files selected for processing (6)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
tests/unittest/_torch/modules/moe/moe_test_utils.py
tests/unittest/_torch/modules/moe/test_moe_module.py

💤 Files with no reviewable changes (1)

tests/unittest/_torch/modules/moe/moe_test_utils.py

📝 Walkthrough

Walkthrough

This pull request adds lifecycle management to TensorRT-LLM's fused MoE communication strategies. A destroy() method is introduced across communication classes (base, DeepEP, DeepEPLowLatency) to explicitly release resources. ConfigurableMoE gains context manager support (enter, exit) and calls destroy() when switching communication strategies. Test utilities are updated to remove a device-specific skip condition and adjust context manager syntax.

Changes

Cohort / File(s)	Summary
Communication Strategy Lifecycle `tensorrt_llm/_torch/modules/fused_moe/communication/base.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py`	Added non-abstract `destroy()` method to base Communication class and implemented it in DeepEP and DeepEPLowLatency subclasses to explicitly release internal buffers and prevent deadlock/hang from barrier synchronization during garbage collection.
ConfigurableMoE Lifecycle & Strategy Management `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`	Implemented context manager protocol (`__enter__`, `__exit__`) and `destroy()` method for ConfigurableMoE. Modified `determine_communication_method()` to call `destroy()` on existing communication strategy before replacing it with AllGatherReduceScatter.
Test Utilities & Module Updates `tests/unittest/_torch/modules/moe/moe_test_utils.py`, `tests/unittest/_torch/modules/moe/test_moe_module.py`	Removed device-capability skip rule for DEEPEPLOWLATENCY on H100 SM90. Updated test context manager syntax from nested `with` blocks to tuple-based form for improved resource management.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 73.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely summarizes the main change: fixing a MoE DeepEP hang caused by non-deterministic GC, with proper ticket reference [TRTLLM-11037] and type [bug].
Description check	✅ Passed	The PR description includes all critical sections: Summary, Test plan with checkboxes, and covers what/why of the changes. While some non-critical template sections (PR Checklist) are absent, the description is substantially complete and follows the intended template structure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-03-10T03:23:15Z

PR_Github #38362 [ run ] triggered by Bot. Commit: 140c9d2 Link to invocation

tensorrt-cicd · 2026-03-10T10:14:35Z

PR_Github #38362 [ run ] completed with state SUCCESS. Commit: 140c9d2
/LLM/main/L0_MergeRequest_PR pipeline #29731 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv requested a review from a team as a code owner March 10, 2026 03:10

xxi-nv requested a review from HuiGao-NV March 10, 2026 03:10

github-actions bot assigned xxi-nv Mar 10, 2026

xxi-nv requested review from QiJune and leslie-fang25 and removed request for HuiGao-NV March 10, 2026 03:13

leslie-fang25 approved these changes Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC#12060

[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC#12060
xxi-nv wants to merge 1 commit intoNVIDIA:mainfrom
xxi-nv:fix_moe_bugs

xxi-nv commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

xxi-nv commented Mar 10, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xxi-nv commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

xxi-nv commented Mar 10, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xxi-nv commented Mar 10, 2026 •

edited by coderabbitai bot

Loading